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Abstract 

We consider the generalization problem for a perceptron with binary synapses, implementing the Stochastic Belief- 
Propagation-Inspired (SBPI) learning algorithm which we proposed earlier, and perform a mean-field calculation to 
obtain a differential equation which describes the behaviour of the device in the limit of a large number of synapses N. 
We show that the solving time of SBPI is of order N^/log N, while the similar, well-known clipped perceptron (CP) 
algorithm does not converge to a solution at all in the time frame we considered. The analysis gives some insight into 
the ongoing process and shows that, in this context, the SBPI algorithm is equivalent to a new, simpler algorithm, 
which only differs from the CP algorithm by the addition of a stochastic, unsupervised meta-plastic reinforcement 
process, whose rate of application must be less than yjlj (irN) for the learning to be achieved effectively. The ana- 
lytical results are confirmed by simulations. 

PACS numbers: 87.18. Sn, 84.35. +i, 05.10.-a 
subclass: 68T05, 68T15, 82C32 
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1 Introduction 

The perceptron was first proposed by Rosenblatt [T5] as an extremely simplified model of a neuron, consisting 
in a number iV of input lines, each one endowed with a weight coefficient (representing the individual synaptic 
conductances), all of which converge in a central unit (representing the soma) with a single output line (the 
axon). Typically, the output is computed from a threshold function of the weighted sum of the inputs, and 
the time in the model is discretized, so that, at each time step, the output does only depend on the input at 
that time step. The unit can adapt its behaviour over time by modifying the synaptic weights (and possibly 
the output threshold), and thus it can undergo a learning (or memorizing) process. In this paper, we only 
consider the case in which the learning process is "supervised", i.e. in which there is some feedback from the 
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outside, typically in the form of an "error signal", telling the unit that the output is wrong, as opposed to 
"unsupervised" learning, in which no feedback is provided to the unit. 

Despite its simplicity, the perceptron model is very powerful, being able to process its inputs in parallel, 
to retain an extensive amount of information, and to plastically adapt its output over time in an on-line 
fashion. Furthermore, it displays a highly non-trivial behaviour, so that much research has been devoted 
to the study of its analytical properties and of the optimal learning strategies in different contexts (see 
e.g. [TJ H3 [7J GO HH H31 Ull and references therein). In the supervised case, there are typically two scenarios: 
the first, which we will call "classification" problem in the rest of this paper, is defined by a given set of input- 
output associations that the unit must learn to reproduce without errors, while the second, which we will 
call "generalization" problem, is defined by a given input-output rule that the unit must learn to implement 
as closely as possible. Here, we will mainly focus our attention on this last problem. 

Furthermore, we will restrict to the case in which the synaptic weights are assumed to be binary variables. 
Binary models are inherently simpler to implement and more robust over time against noise with respect 
to models in which the synaptic weights are allowed to vary over a continuous set of values, while having 
a comparable information storage capacity [6l l9l 1X3] ; furthermore, some recent experimental results [T5l IT6]. 
as well as some arguments from theoretical studies and computer simulations^! EH [TU1 I14J . suggest that 
binary-synapses models could also be more relevant than continuous ones as neuronal models exhibiting 
long term plasticity. However, from an algorithmic point of view, learning is much harder in models with 
binary synapses than in in models with continuous synapses: in the worst-case scenario, the classification 
learning problem is known to be NP-complete for binary weights[4j, while it is easy to solve it effectively 
with continuous weights[6j. Even in the case of random, uncorrelated inputs-output associations, the solution 
space of the classification problem in the binary case is in a broken-symmetry phase, while in the continuous 
case it is not[13j, implying that the learning strategies which successfully solve the learning problem in the 
latter case are normally not effective in the former. 

Despite these difficulties, an efficient, easily implementable, on-line learning algorithm can be devised 
which solves efficiently the binary classification problem in the case of random, uncorrelated input stimuli [I . 
Such an algorithm was originally derived from the standard Belief Propagation algorithm^ [TTJ, (2U|, and 
hence named 'Stochastic Belief Propagation-Inspired' (SBPI). The SBPI algorithm makes an additional 
requirement on the model, namely, that each synapse in the device, besides the weight, has an additional 
hidden, discretized internal state; transitions between internal states may be purely meta-plastic, meaning 
that the synaptic strength does not necessarily change in the process, but rather that the plasticity of the 
synapse does. The SBPI learning rules and hidden states requirements are the same as those of the well 
known clipped perceptron algorithm (CP, see e.g. |17|). the only difference being an additional, purely meta- 
plastic rule, which is only applied if the answer given by the device is correct, but such that a single variable 
flip would result in a classification error. 

The SBPI algorithm was derived and tested in the context of the classification problem: in such scheme, 
all the input patterns are extracted from a given pattern set, randomly generated before the learning session 
takes place, and presented to the student device repeatedly, the outcome being compared to the desired one, 
until no classification errors are made any more. Since the analytical treatment of the learning process is 
awkward in such case, due to the temporal correlations emerging in the input patterns as a consequence of 
the repeated presentations, we could only test the SBPI algorithm performance by simulations, and compare 
it to that of other similar algorithms, such as the CP algorithm and the cascade model [Jj. It turned out 
that the additional learning rule which distiguishes the SBPI algorithm from the CP algorithm is essential 
to SBPI's good performance, and that there exists an optimal number of parameters for both the number of 
internal hidden states per synapse and for the rate of application of the novel learning rule. 

In order to understand the reason for the SBPI new rule's effectiveness, it is necessary to give an analytical 
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description of the learning process under the CP and SBPI learning rules; to this end, we consider here the 
problem of generalization from examples: in such scheme, the input patterns are generated afresh at each 
time step, and the goal is to learn a linearly separable classification rule, provided by a teacher device. Perfect 
learning is achieved if the student's synapses match the teacher's ones. Using a teacher unit identical to the 
student to define the goal input-output function ensures that a solution to the problem exists and is unique, 
but we also briefly consider the case of a non-perfect-learnable rule, provided by a continuous- weights teacher 
perceptron. The learning process in this case is much easier to treat analytically than in the classification 
case, since the input patterns are not temporally correlated, and the dynamical equations for this system can 
be derived by a mean field calculation for the case of a learnable rule. This problem is in fact easier to address 
than the classification problem, and optimal algorithms can be found which solve it in the binary synapses 
case as well (see e.g. [H [T71 US]); however, such algorithms are not suitable to be considered as candidates 
for biological models for online learning, being either too complex or requiring to perform all intermediate 
operations with an auxiliary device with continuous synaptic weights. 

The resulting differential equation set that we obtained gives some insight on the learning dynamics and 
about the reason for SBPI's effectiveness, and allows for a further simplification of the SBPI algorithm, 
yielding an even more attractive model of neuronal unit, both from the point of view of biological feasibility 
and of hardware manufacturing design simplicity. With a special choice for the parameters, the solution to 
the equation set is simple enough to be studied analytically and demonstrate that the algorithm converges 
in a number of time steps which goes as N^/log N. All the results are confirmed by simulations. 

The outline of the rest of this paper is as follows: in Sects. 2 and 3 we define in detail the learning 
algorithm and the generalization problem, respectively. In Sects. 4 and 5 we derive the mean-field dynamics 
for the CP and SBPI algorithms, and in Sect. 6 we derive the set of continuous differential equations which 
describes the process in the N — > oo limit and exhibit a solution. In Sect. 7 we consider a special case in 
which the equation set can be simplified and derive some analytical results on convergence time in such case. 
In Sect. 8 we consider the case of bounded hidden states. In Sect. 9 we briefly consider the case of a non 
learnable rule. In Sect. 10 we discuss the simplified algorithm derived in Sect. [SJ We summarize our results 
in the last section. 

2 The SBPI learning algorithm 

The device we consider is a binary perceptron with iV synapses, each of which can take the values Wi = ±1, 
receiving inputs = ±1, with output cr M = ±1, and threshold 9 = 0. Thus, the device output is given as a 
function of the inputs and of the internal state as 



Furthermore, each synapse is endowed with a discretized internal variable hi, which only plays an active 
role during the learning process; for simplicity, we will consider it to be an odd-valued integer. At any given 
time, the sign of this quantity gives the value of the corresponding synaptic weight, wi = sign(/ii). We will 
start by considering the case of unbounded hidden states, and then turn to the bounded case. 

SBPI is an on-line supervised learning algorithm; upon presentation of a pattern {£" ,cr^}, where cr^, is 
the desired output, the stability is computed as 
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The way synaptic weights are updated depends on the value of A^: 

1. If A p > m , then nothing is done 

2. If < A' 1 < 6 m , then only the synapses for which Wi = £f a^, are updated, and only with probability 

Ps- 

3. If A p < 0, then all synapses are updated 

Here, 9 m is a secondary threshold, expressed as an even integer, and p s G [0, 1]. The update rule applies to 
the hidden synaptic variables: 

hi -> hi + 2#o£ 

The factor 2 is required in order to keep the value of the hidden variables odd, which in turn is useful for 
avoiding the ambiguous, but otherwise immaterial, hi — case. Note that the only actual plasticity events 
occur when the hidden variables change sign; also, the update in rule 2 is always in the direction of increasing 
the hidden variables' modulus, thus reinforcing the synaptic value by making it less likely to switch. 

When the probability p s or, equivalently, when the secondary threshold 6 m are set to 0, rule 2 is never 
applied and the algorithm is reduced to the CP algorithm. 

In the special case, p s = 1 and 9 m = 2, we refer to the algorithm as to BPI. 

3 Definition of the generalization learning problem 

The protocol which was originally used to obtain the SBPI update rules was that of classification of random 
patterns extracted from a given set; learning of the correct classification was achieved by repeated presenta- 
tions of the patterns from the set and application of the update rules. The maximum number of input-output 
associations that the system could memorize in this way was shown by simulations to be proportional to the 
number of synapses N, the coefficient of proportionality being fairly close to the maximal theoretical value, 
with an order O (log (A) 1 ' 5 ^ presentations per pattern required on average. 

Here instead we will consider the problem of learning a rule from a teacher perceptron, identical to the 
student (the case of a different teacher device being considered in Sec. [9]); the patterns are generated at random 
at each time step, each input £ibeing extracted independently with probability P (£j = +1) = P (£j = —1) = 
1/2, and the desired output is given by the teacher. Thus, the goal is to reach a perfect overlap with the 
teacher, an event which can be thought of as the student having learned an association rule. An optimal 
learning algorithm for this problem, which reaches the solution in about 1.245iV steps in the limit of large 
N, can be derived by the Bayesian approach[I5] (which is equivalent to the Belief Propagation approach 
[5] in this case); however, this optimal algorithm does not work in an on-line fashion, as it requires to keep 
the memory of each pattern which was presented thus far to the device. An on-line approximation of the 
optimal algorithm, proposed in [19| and later re-derived from a different approach in [1] as an intermediate 
step towards SBPI, overcomes this problem at the expense of a lower performance, but it still requires the 
internal storage of continuous quantities, and complex computations to be performed at each time step. 

In order to simplify the notation in the rest of this paper, we will assume that the student is always 
trained only on patterns whose desired output is +1, which can be insured in this way: at each time r a new 
pattern {x[} i is generated randomly and presented to the teacher, whose output is trj.; then, the pattern 
{£[} = {&TXi} is presented to the student, with desired output a T D = +1. Also, we can assume, without loss 
of generality, that all the teacher's synapses are set to wf = +1. This implies that the student will only be 
presented patterns in which there are more positive than negative inputs. 

In the following, we shall show that it is possible to describe the average learning dynamics and estimate 
the time needed for the student to reach overlap 1 with the teacher, q — -h (w ■ w T ) = 1. 
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4 Histogram dynamics for the CP algorithm 

We will do a mean-field-like approximation to the problem: at each time step, given the histogram of the 
hidden variables at a time r, P r ({hi}), we compute the average distribution (over the input patterns) at 
time t + 1, P T+l ({hi}), and iterate. The approximation here resides in the fact that, at each step, what we 
obtain is an average quantity (a single histogram), which we use as input in the following step, while a more 
complete description would involve the evolution of the whole probability distibution over all the possible 
resulting histograms. Therefore, we are implicitly assuming that the spread of such probability distribution 
around its average is negligible; our results confirm this assumption. 

We will start from the simpler case of the CP algorithm (no rule 2), and temporarily drop the index r. 

Let us first compute the probability of making a classification error. This only depends on the current 
teacher-student overlap q. We will denote by q + (q~) the fraction of student synapses which are set to +1 
(— 1), so that the overlap is q = q + — g_ = 2q + — 1. In the following, we have to consider separately the +1 
and —1 synapses: we denote by z/ + the number of positive inputs over the positive synapses, and by i>- the 
number of positive inputs over the negative synapses. Because of the constraint on the patterns, there have 
to be more positive inputs than negative ones, i.e. v + + z/_ > ^. The perceptron will classify the pattern 
correctly if v+ + (q~N — vJ) > y, thus the probability that the student makes an error is given by 

Pe = 2 J dfi (v + ) d/j, 6 \v + + v_ - y 

N 



where \x (v± ) is the measure over v± without the constraint on the pattern (which is explicitly obtained by 
cutting half of the cases and renormalizing) . In the large N limit, this is a normal distribution, centered on 
with variance 2±— f thus we can write the above probability as 

y/qlx-) 9 (-yfq^x + + yfq~x_) 




arccos(g) (1) 



where we used the shorthand notation Dx = dx^y=e ^ (eq. [TJ is the standard relation between the gener- 
alization error and the teacher-student overlap in perceptrons, see e.g. [5]). 

We then focus on a synapse with negative value, and compute the probability that there is an error and 
that the synapse receives a positive input: 

P(A<0A& = IK = -1) = 

Dx + Dx_ ( i + X Z— ] 6 (y/qTx+ + s/q~xJ) 6 (-^x+ + ^fq~X-) 



2 2y/q^N 



Pe, 1 >0 (l 



2 V^kN \N 
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The probability that a negative- valued synapse receives a negative input, and that an error is made, is 
very similar: 



P(A <0A & = -IK 
The probabilities for positive- valued synapses are simpler: 



P(A<0A&=±lK = +l) = ^+o(l 

Therefore, a positive- valued synapse (which is thus correctly set with respect to the teacher) has an equal 
probability of switching up or down one level, while a negative-valued one (which is thus wrongly set) has 
a higher probability of switching up than down. The histogram dynamics can be written in a first-order 
approximation as: 



(h) 



P T (h) [I -Pi] 
+ P T (h + 2) 

+ P T (h-2) 



K 

2 



+ 



e(- (h + 2)) 

y/2^N 

e(- (h-2)) 

y/2^N 



(2) 



where, as usual, the h's are assumed do be odd. It can be easily verified that normalization is preserved 
by this equation. 

Note that, if p e is very small, £f . 1 Ar may become negative, which is meaningless; in terms of the 

overlap, this happens when q_N < -|, i.e. when convergence is reached up to just one or two synapses (in 
fact, this does never happen with the CP algorithm, which does not appear to ever converge to the solution 
or to even get close to convergence). This is due to the fact that the gaussian approximation we used is not 
valid any longer when g_ is of order TV -1 ; note however that this is not really an issue for practical purposes, 
as simulations show that whenever the algorithm gets into this region, convergence is eventually reached in 
short time. 



5 Histogram dynamics for the SBPI algorithm 

We now turn to SBPI. We have to compute the probability that the new rule 2 is applied, which happens 
when < A < 9 rn with probability p s ; thus: 



Pb 



2p s J Dx + Dx_ 6 (V<Z+£+ + Vo^x-) 6 (y/q+x+ - ^fq~xJ) 

x e 

PsOn 



^/2^N 



+ o 



(3) 



The leading term is of order N~ 2 , so there's no need to distinguish between positive and negative synapses 
here, because the difference between the two cases is of order iV _1 . Thus, each synapse has a probability pb/2 
of moving away from and a probability Pb/2 of standing still, since only half of the synapses are involved 
in rule 2 each time it is applied. 
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We note that the result does not depend on the internal state of the device: it is a constant, acting for 
both positive and negative synapses. Furthermore, we see that we can reduce the number of parameters by 
defining 

k = p s m (4) 

This means that, in the generalization context and in the limit of large N, rule 2 in the SBPI algorithm 
can be substituted by a stochastic, generalized and unsupervised reinforcement process. We shall come back 
to this issue in Sec. [TUl 

Using eq. [3] we can add rule 2 to eq. [2j getting the full SBPI dynamics: 

P T+1 (h) = P T (h) J^ + P T (h + 2) JL(h + 2) + P T (h-2) J T + {h-2) (5) 



where 



A - i-rf- m 



.r_(h) = PL-e(-h)^^ + e(h)^H= (6) 
r + (h) = § + e {-h) -L= + e (h) Jfi= 

The agreement between this formula and the simulations is almost perfect, except when the average 
number of wrong synapses is very low, i.e. when q_ N is of order 1, as can be seen in Fig. [5] 

6 Continuous limit 

Equations [S] and [5] can be converted to a continuous equation in the large N limit, by rescaling the variables: 

= (8) 

N 



and using a probability density 

p (x, t) = y/NP Nt (VNx) (9) 



Note that the v N scaling of the hidden variables is the same which we found empirically in the classifi- 
cation learning problem pQ. 

Using these and taking the limit N — > oo we get the partial differential equation set: 

f (M) = 2p e (t)g(M) 

1 dp (x,t)[(4-k)Q(-x) + te(x)] + 



V2ndx 

+ S(x)e (-x) 7- (t) +5(x)G (x) 7 + (t) (10) 
p e (t) = - arccos (q (*)) (11) 

7T 

/>oo 

q(t) = 2 dxp(x,t) - 1 (12) 
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where 5 (x) represents the Dirac delta function. 

The two quantities 7" (t) and 7+ (t) don't need to be written explicitly, since they can be specified by 
imposing two conditions on the solution, normalization and continuity: 

+00 

P (x,t) = 1 (13) 

-OO 

p(Q-t) - p(0+t) (14) 

The reason for the continuity requirement, eq. (|14[) . is the following: if there would be a discontinuity 
in x — 0, the net probability flux through that point would diverge, as can be seen by direct inspection of 
eq. ([5|) and considering how the r and h variables scale with N. Note that, in the BPI case k = 2, enforcing 
these two constraints simply amounts at setting 7 ± (t) = 0, as discussed in the next section. 

As a whole, eq. (|10p is non-local, since the evolution in each point depends on what happens at x = 0; 
on the other hand, it greatly simplifies away from that point: on either side of the x axis, it reduces to a 
Fokker-Planck equation, with a time-dependent coefficient of diffusion, and a constant drift term. In general, 
the drift term is different between the left and right side of the x axis, and depends on k; this difference gives 
rise to an accumulation of the probability distribution on both sides of the point x = (expressed by the two 
Dirac delta functions in the equation). 

For negative x, equation [TU] reads: 

di M = Pe ® dx* ( ' ^ ~ ~J^~ ( ' ^ (15) 



If the initial distribution, at time t , is a gaussian centered in Xq with variance Vq, then the solution to 
this equation is a gaussian whose center x (t) and variance v (t) obey the equations: 

x(t) = x + ^=t(t-t ) (16) 

V Z7T 



v(t) = v Q +A f dt' Pe (t') (17) 

Jto 

Let us call g~ (x, t, to) such a solution, assuming xo — and vq — (i.e. assuming the initial state to be a 
Dirac-delta centered in 0). We can define in an analogous way a solution to the x > branch of equation [POl 



i<*« - 2f> - (! »S fe,) -^l fe " < 18) 

As before, this equation transforms gaussians into gaussians: the corresponding solution g + (x,t,to) only 
differs from g~ in that the centre of the gaussian moves to the right with a velocity proportional to k, rather 
than 4 — k. 

Overall, this gives a qualitative understanding of what happens during learning: away form x = 0, on 
both sides there's a diffusion term (the same for both), which tends to if the majority of the synapses gets 
to the right side of the x axis. The synapses are 'pushed' right by the drift with 'strength' k on the right side 
and 4 — k on the left side. Right at x = 0, there's a bi-directional flux between the two sides of the solution, 
such that the overall area is conserved and that the curve is continuous (even if the derivatives are not). 
Thus, it is evident that both k < and k > 4 are very poor choices (and they include the CP algorithm, 
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which corresponds to k = 0). If the majority of the synapses eventually reaches the right side, the diffusion 
stops and the drift dominates. The evolution of the histograms at different times for different values of A; is 
shown in Fig. [TJ 

An analytical solution to eq. (|10|l can be written in terms of the functions defined above: the flux 
through x = gives rise, in the continuous limit, to the generation of Dirac deltas in the origin, which in turn 
behave like gaussians of variance that start to spread and shift. Due to the homogeneity of the equation, 
this allows to write a solution as a weighted temporal convolution of evolving gaussians: first, we write the 
initial condition as p (x, 0) = po (x); then, we define p^ (x, t) as the time evolution of po (x) under eq. (fT5j) 
and Pq (x, t) as the time evolution of p (x) under eq. (|18[) (these can normally be computed easily, e.g. by 
means of Fourier transforms) . This allows us to write the solution in the form: 

p (x, t) = 8 (-x) p~ (x, t) + Q (x) p + (x, t) (19) 

where 

p ± (x, t) = p± (x, t) + [ dt 1 t± (t') g ± (x, t, t') 
Jo 

with the constraints given in eqs. (fT3)l and (|T^|) . This solution can be verified by direct substitution in eq. (fTU)) ; 
it is not likely to be amenable to further analytical treatment, but it is sufficient for numerical integration, 
which indeed shows an almost perfect agreement with the data obtained through histogram evolution at large 
N, as shown in Fig. [5^. 

7 Density evolution for BPI 

In the BPI case, i.e. when k — 2, the two sides of eq. (fTU)) are equal; thus, the terms 7 ± (t) both vanish, and 
eq. (fTOf simplifies to: 

= 2 Pe (t)0 M - (M) (20) 

If the initial distribution is a gaussian centered in xq with variance vq, p (x, 0) = ^7=G (^/§§^j > then the 
evolution of the distribution is described by the following system of equations: 



p(x,t) 


1 r (x-x{t)\ 


(21) 


x(t) 




(22) 


v(t) 


= v + 4 / dt' Pe (t') 
Jo 


(23) 


Pe (t) 


= — arccos (q (t)) 


(24) 


q(t) 


- erff *W ) 


(25) 



Thus, the gaussian shape of the distribution is preserved, but its center and its variance evolve in time: 
the center moves to the right at constant speed, while the variance derivative is proportional to the error rate. 
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Fig. 1: Evolution of the histograms with time (dark lines to light lines, taken in time steps of At = 3, from 
t = 1 to t = 25), in simulations with four different algorithms (500 samples at N = 32001). In 
panels a and b, the positive and negative sides of the curve obey different differential equations; in 
the CP algorithm there's no drift term on the right side, and thus the majority of the synapses stays 
near zero, causing a significant fraction of the synapses to be pushed back to the negative side. The 
distributions are gaussians for the unbounded BPI algorithm (panel c), while setting a boundary 
makes the histograms accumulate at the boundary (panel d). In all cases, the initial distribution was 
random, with all the synapses at h = ±1. 
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Convergence is thus guaranteed, since the variance can grow at most linearly, which means that the width 
of the distribution can grow at most as y/i, while the center's speed is constant. Thus, for sufficiently large 
times, the negative tail of the distribution, which determines the error rate (p e ~ yT — q when q — > 1), will 
be so small that the variance will almost be constant, and this in turn implies that the error rate decreases 
exponentially with time. If we define the convergence time T c as the time by which the number of wrong 
synapses becomes less than 1, i.e. when Nq ~ 1, we find that asymptotically T c ~ y'log N, which means that 
the non-rescaled convergence time is almost linear with the number of synapses. 

Fig. shows the overlap and error rate as a function of time; the agreement of the analytical solution 
with the simulation data is almost perfect, except when g_ is very small, as shown in Fig. rjfc. 

8 Bounded hidden variables 

We can easily introduce a limit over the number of available hidden states, by setting a maximum value 
c^/N for the modulus of h. Obviously, if c is too small the algorithm's performance is impaired, while if 
c is large enough it has no effect; in between, the behavior depends on the value of k. It turns out that 
setting a boundary over h can effectively improve performance for BPI (k = 2), but it has almost no effect for 
the optimal SBPI algorithm, with k ~ 0.8, similarly to what happens in the classification problem scenario 
studied in [T]: in fact, the optimum in this case was found to occur with k = 1.2, at c = 2.5. The results are 
summarized in Fig. [3J An example of bounded histogram evolution is in the last panel of Fig. [TJ 

9 Teacher with continuous synaptic weights 

The above results were derived in the scenario of the generalization of a learnable rule, the desired output 
being provided by a binary perceptron. In this section, we consider instead the case of a non learnable rule, 
provided by a perceptron with continuous synaptic weights extracted at random from a uniform distribution 
in the range [—1, 1]: the minimum generalization error is no longer in this case, and our previous mean- field 
approach is not able to provide a simple analytic solution; however, eq.[T] still holds true (in the limit of large 
N), and hence the best possible assignment of the student's weights is obtained by taking the sign of the 
teacher's weights, in which case the generalization error is equal to 1/6. 

Our simulations show (Fig. QJi) that even in this case the SBPI algorithm outperforms the CP algorithm 
when the parameter p a is chosen in the appropriate range, and that there exists an optimal value for p s such 
that the generalization error rapidly gets very close to the optimal value, even though the optimum is reached 
in exponential time (arguably due to the fact that the region around the solution is very flat in this case, 
because some of the teacher's weights are so small that their inference is both very difficult and not very 
relevant). 

One important difference between this case and the previous one is that both the optimal and the maxi- 
mum value of the parameter p s (the maximum value is the one above which the performance becomes equal 
or worse than that of CP) are not fixed with varying N: rather, they both scale following the same power 
law (Fig. Hfc.). 

10 A simplified algorithm: CP+R 

We have shown in Sec. [S] that, in the limit of a large number of synapses N and in the context of the 
generalization learning of a learnable rule, the effect of the additional rule which distinguishes the SBPI 
algorithm from the CP algorithm, and which is responsible for the superior performance of the former with 
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Fig. 2: Comparison between simulations (light solid lines), histogram evolution (solid lines) and continuous 
probability density evolution (dark dotted lines), for three different algorithms (red: CP, green: SBPI 
with k = 0.8, blue: BPI), at different times. The curves were taken at N = 32001, and initialized 
as for Fig. [T] The agreement between the simulations and the two analytical predictions is almost 
perfect, except when q^. is very small, a. Histograms at different times. The analytical curves are 
not available for SBPI at t = 10 since at that point the algorithm has already converged and the 
approximations used are no longer valid, b. Average overlap q (top curves, starting from 0) and error 
rate p e (bottom curves, starting from 0.5) vs time. c. Fraction of wrong synapses g_ vs time, in 
logarithmic scale. This can be used as an estimate of the convergence time with N; the BPI curve is 

fit asymptotically by a curve which goes like t oc J log (qZ 1 ) (not shown). 
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Fig. 3: Solving time with different values of the rescaled boundary c. Each point represents the average over 
80 samples at N = 32001 (standard deviations are smaller than the point size). The overall optimum 
is found at c = 2.5 with k = 1.2. These results are consistent with those found with N — 64001 (not 
shown) . 




Fig. 4: Simulation results using a teacher with continuous weights, a. Average error rate p s vs time for 
N = 501, 1000 samples (dotted black line: minimum possible error; red dashed curve: CP; green solid 
curve: SBPI with optimal p s = 0.8). The learning curves scale with N following the same power law 
shown in the next panel, b. Scaling of the p s parameter with N, in logarithmic scale, and best fit. 
The fitting curves have the form aN~ b , the fitting parameters are a — 9.9 ± 0.8, b — 0.403 ± 0.008 
(optimal) and a = 20.5 ± 2.4, b = 0.400 ± 0.013 (maximum). 
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respect to the latter, is on average equivalent to applying an unspecific, constant and low-rate meta-plastic 
reinforcement to all the synapses (see eq. [3]) . This reinforcement process is only effective if it is not too strong, 
because otherwise it would overcome the effect of the learning by keeping all of the synapses away from the 
plastic transition boundary (i.e. away from x = 0, in the notation of Sec. [6J. 

This suggests that the SBPI algorithm can be further simplified, leading to a "clipped perceptron plus 
reinforcement" algorithm (CP+R), i.e. the CP algorithm with the additional prescription that, at each time 
step r, each synaptic weight undergoes a meta-plastic transition hj — > K[ + 2sign(/iJ") with probability p r , 



where < p r < w (the time step index r does not increment in the reinforcement process, because it is 



superimposed to the standard learning rules and acts in parallel with them). Any value of p r greater then 
makes a qualitative difference with respect to CP. 

The CP+R algorithm is only equivalent to the SBPI algorithm in the generalization of a learnable rule 
scenario. Indeed, in the case of a the non-learnable rule of Sec. [HI the relationship of eq. [3] does not hold any 
more; however, the CP+R algorithm still proves as effective as SBPI when the parameter p r is properly set 
(not shown). 

In the classification problem, on the other hand, the performance of CP+R is worse in terms of capacity 
by a factor of the order of 2 with respect to SBPI. However, our preliminary results show that the difference 
in such scenario between the two algorithms shows up only in the latest phases of the learning (when the 
temporal correlations in the inputs make a difference), and that simply reducing the rate of application of 
the reinforcement process p r during the learning along with the error rate is sufficient to recover the SBPI 
performance even in that case. This will be the subject of a future work. 

From the architectural point of view, such CP+R algorithm is even simpler than the SBPI algorithm 
(which was already derived as a crude simplification of the Belief Propagation algorithm); thus, it may be 
an even better candidate for modelling supervised learning in biological networks, which have very strict 
requirements about robustness, simplicity and effectiveness. Its only serious drawback with respect to SBPI 
is that the random reinforcement must be applied sparingly, since the probability is of order O \ l/\/N\ , 

which would require some fine-tuning mechanism of the cells behaviour; SBPI, on the other hand, requires 
detection of near-threshold events in order to trigger the reinforcement rule, which may also be problematic. 
Furthermore, even if the learning rate under CP+R is sub-optimal with respect to the generalization protocol 
problem, its extreme simplicity and robustness might be attractive for hardware implementations of binary 
perceptron units with very large number of synapses as well, because it is adaptable to both the classification 
and the generalization scenarios, and, even in the latter (algorithmically easier) case, it greatly reduces the 
overhead associated with the complex computations required by the faster algorithms, while still having a 
very good scaling behaviour with N, as the steps required grow at most as O (Ny/\ogN). 

11 Summary 

In this paper, we have studied analytically and through numerical simulations the SBPI algorithm dynamics 
in the supervised generalization learning scenario and in the limit of a large number of synapses N. 

The original goal, which was that of claryfing the role of the novel learning rule introduced by this 
algorithm, was approached by studying the average dynamics of the internal synaptic state and separating 
the contributions due to the different learning rules, which allowed us to derive a partial differential equation 
describing the learning process in terms of a diffusion process. The solution of such equation in a (non-optimal) 
special case provided us with an estimate for the learning time, which turned out to scale as N^/log N. The 
analytical predictions were found to be in excellent agreement with the numerical simulations. 

We have also obtained some results from simulations under circumstances in which the previous analytical 
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approach failed, and found that the SBPI algorithm can be further optimized by setting properly a hard 
boundary to the number of internal synaptic states (scaling as \/~N), which confirms our previous results in 
the context of classification learning, and that its enhanced effectiveness with respect to CP is not limited to 
learnable rules. 

The analytical results, together with their interpretation in terms of the synaptic states' dynamics, have 
also suggested the introduction of a novel, simplified algorithm, called CP+R, which proved in our preliminary 
results to be as much effective as SBPI under all the circumstances in which we have tested it (with some 
minor adjustments), making it a good candidate for biological and electronic implementations, and which 
will be the subject of a future work. 

Acknowledgements I thank A. Braunstein, N. Brunei and R. Zecchina for helpful discussions and comments 
on the manuscript. 
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